Controlling resource transfers in a logically partitioned computer system

ABSTRACT

A resource and partition manager of the preferred embodiments includes a lock mechanism that operates on a plurality of locks that control access to individual I/O slots. The resource and partition manager uses the lock mechanism to obtain a lock on an I/O slot when transferring control of the I/O slot to a logical partition that is powering on and when removing the I/O slot from a logical partition that is powering off. The resource and partition manager uses the lock mechanism to remove control of an I/O slot from, or return control to, an operating logical partition in order to facilitate hardware service operations on that I/O slot or on the physical enclosure in which it is contained.

CROSS-REFERENCE TO PARENT APPLICATION

This patent application is a continuation of “Apparatus and Method forControlling Resource Transfers in a Logically Partitioned ComputerSystem”, U.S. Ser. No. 11/374,883 filed on Mar. 14, 2006, which isincorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention generally relates to data processing, and morespecifically relates to allocation of shared resources in a computersystem.

2. Background Art

Since the dawn of the computer age, computer systems have evolved intoextremely sophisticated devices that may be found in many differentsettings. Computer systems typically include a combination of hardware(e.g., semiconductors, circuit boards, etc.) and software (e.g.,computer programs). As advances in semiconductor processing and computerarchitecture push the performance of the computer hardware higher, moresophisticated computer software has evolved to take advantage of thehigher performance of the hardware, resulting in computer systems todaythat are much more powerful than just a few years ago.

The combination of hardware and software on a particular computer systemdefines a computing environment. Different hardware platforms anddifferent operating systems thus provide different computingenvironments. In recent years, engineers have recognized that it ispossible to provide different computing environments on the samephysical computer system by logically partitioning the computer systemresources to different computing environments. The iSeries computersystem developed by IBM is an example of a computer system that supportslogical partitioning. If logical partitioning on an iSeries computersystem is desired, resource and partition manager code (referred to as a“hypervisor” in iSeries terminology) is installed that allows definingdifferent computing environments on the same platform. Once the resourceand partition manager is installed, logical partitions may be createdthat define different computing environments. The resource and partitionmanager manages the logical partitions to assure that they can shareneeded resources in the computer system while maintaining the separatecomputing environments defined by the logical partitions.

A computer system that includes multiple logical partitions typicallyshares resources between the logical partitions. For example, a computersystem with two logical partitions could be defined that allocates 50%of the CPU to each partition, that allocates 33% of the memory to thefirst partition and 67% of the memory to the second partition, and thatallocates two different I/O slots to the two logical partitions, one perpartition. Once logical partitions are defined and shared resources areallocated to the logical partitions, each logical partition acts as aseparate computer system. Thus, in the example above that has a singlecomputer system with two logical partitions, the two logical partitionswill appear for all practical purposes to be two separate and distinctcomputer systems.

One problem with known logically partitioned computer systems occurswhen hardware resources need to be transferred between logicalpartitions. For example, if a PCI slot in a first logical partitionneeds to be transferred to a second logical partition, the PCI slot mustfirst be removed from the first logical partition, and the PCI slot canthen be allocated to the second logical partition. Note, however, thatonce the PCI slot has been removed from the first logical partition, inthe prior art two logical partitions might compete for control of thePCI slot at the same time. In addition, when a PCI slot is allocated toa different logical partition, it may contain data from the previouslogical partition that could be compromised under certain circumstances.Furthermore, the PCI slot may be configured in a particular statesuitable for the first logical partition, which is not necessarilysuitable for the second logical partition. Without a way to dynamicallytransfer I/O resources in a logically partitioned computer systemwithout the drawbacks known in the art, the computer industry willcontinue to suffer from potentially insecure and inefficient mechanismsand methods for performing I/O resource transfers in logicallypartitioned computer systems.

BRIEF SUMMARY OF THE INVENTION

A resource and partition manager of the preferred embodiments includes alock mechanism that operates on a plurality of locks that control accessto individual I/O slots. The resource and partition manager uses thelock mechanism to obtain a lock on an I/O slot when transferring controlof the I/O slot to a logical partition that is powering on and whenremoving the I/O slot from a logical partition that is powering off. Theresource and partition manager uses the lock mechanism to remove controlof an I/O slot from, or return control to, an operating logicalpartition in order to facilitate hardware service operations on that I/Oslot or on the physical enclosure in which it is contained. Thepreferred embodiments also include methods for releasing systemresources and address bindings allocated to an I/O slot when control theI/O slot is removed from a logical partition, and methods for allocatingand initializing system resources when control of an I/O slot istransferred to a logical partition. In addition, the preferredembodiments include the use of the locks and related mechanisms totransfer I/O slots from the logical partitions to the resource andpartition manager, and later back to the logical partitions, for thepurpose of performing hardware service operations on an I/O slot or thecomponents of a physical enclosure containing these I/O slots.

The foregoing and other features and advantages of the invention will beapparent from the following more particular description of preferredembodiments of the invention, as illustrated in the accompanyingdrawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING

The preferred embodiments of the present invention will hereinafter bedescribed in conjunction with the appended drawings, where likedesignations denote like elements, and:

FIG. 1 is a block diagram of a computer apparatus that supports logicalpartitioning and I/O resource allocation in accordance with thepreferred embodiments;

FIG. 2 is a more detailed block diagram showing one specific hardwareimplementation that may be used in a logically partitioned computersystem in accordance with the preferred embodiments;

FIG. 3 is a block diagram of a specific logically partitioned softwareimplementation that could be implemented on the hardware system shown inFIG. 2 in accordance with the preferred embodiments;

FIG. 4 is a flow diagram of a method for rebooting a logical partitionin accordance with the preferred embodiments;

FIG. 5 is a flow diagram of a method for shutting down a logicalpartition in accordance with the preferred embodiments;

FIG. 6 is a flow diagram of a method for powering up a logical partitionin accordance with the preferred embodiments;

FIG. 7 is a flow diagram of a method for a logical partition torelinquish control of a slot it owns in accordance with the preferredembodiments; and

FIG. 8 is a flow diagram of a method for a logical partition to regaincontrol of a slot it owns in accordance with the preferred embodiments.

DETAILED DESCRIPTION OF THE INVENTION

According to preferred embodiments of the present invention, hardwareresources, such as I/O slots, in a logically partitioned computer systemmay be allocated to a logical partition and removed from a logicalpartition. This allows hardware resources to be transferred in alogically partitioned computer system. Locks are defined that correspondto each hardware resource. To gain access to a hardware resource, alogical partition must acquire ownership of the lock corresponding tothe hardware resource. A power on/power off mechanism assures that ahardware resource is powered off when the hardware resource is removedfrom a logical partition, and assures that a hardware resource ispowered up when the hardware resource is allocated to a logicalpartition. In this manner, each logical partition is assured of seeingthe hardware resource in its power-on reset state. For the specificexample of an I/O slot, by cycling power to the I/O slot when it isremoved from one logical partition and allocated to a different logicalpartition, the power on/power off mechanism assures that both data andconfiguration information from an I/O adapter plugged into the slot arepurged when allocating the I/O slot to a different logical partition. Inaddition, hardware resources may be transferred temporarily from theirlogical partitions to the resource and partition manager in order toperform hardware service operations on a hardware resource or thecomponents of the physical enclosure containing that hardware resource.When the hardware service is complete, the hardware resources aretransferred back to their logical partitions. The locks and relatedmechanisms to transfer the lock between a partition and the resource andpartition manager and back again facilitate such hardware serviceoperations while the partitions continue operations.

Note that the term “hardware resource” as used in this specificationdenotes any whole or fractional portion of hardware in the computersystem that may be independently allocated to a logical partition.Examples of hardware resources include: a physical I/O slot; a group ofI/O slots in a physical enclosure; a portion of a processor; and aportion of memory. The preferred embodiments presented herein use thespecific example of I/O slots as hardware resources that can beindependently allocated to logical partitions. Note, however, that anyhardware or portion of hardware that can be independently allocated to alogical partition falls within the scope of the term “hardware resource”as used herein.

Referring to FIG. 1, a computer system 100 is an enhanced IBM eServeriSeries computer system, and represents one suitable type of computersystem that supports logical partitioning and resource allocation inaccordance with the preferred embodiments. Those skilled in the art willappreciate that the mechanisms and apparatus of the present inventionapply equally to any computer system that supports logical partitions.As shown in FIG. 1, computer system 100 comprises one or more processors110 connected to a main memory 120, a mass storage interface 130, adisplay interface 140, a network interface 150, and a plurality of I/Oslots 180. These system components are interconnected through the use ofa system bus 160. Mass storage interface 130 is used to connect massstorage devices (such as a direct access storage device 155) to computersystem 100. One specific type of direct access storage device is a CD RWdrive, which may read data from a CD RW 195. Note that mass storageinterface 130, display interface 140, and network interface 150 mayactually be implemented in adapters coupled to I/O slots 180.

Main memory 120 contains a resource and partition manager 121, an I/Oslot lock mechanism 122, a power on/power off slot mechanism 124, and Nlogical partitions 125, shown in FIG. 1 as logical partitions 125Athrough 125N. Resource and partition manager 121 preferably createsthese N logical partitions 125. Each logical partition preferablyincludes a corresponding operating system 126, shown in FIG. 1 asoperating systems 126A through 126N.

I/O slot lock mechanism 122 manages access to the I/O slots 180 bydefining a plurality of slot locks 123, with one slot lock 123preferably corresponding to each I/O slot 180. When an I/O slot needs tobe allocated to a logical partition, the resource and partition managerchecks the corresponding slot lock to see if the I/O slot is available.If the corresponding slot lock is owned by a different logicalpartition, the I/O slot is under the control of that logical partition.If the corresponding slot lock is owned by the resource and partitionmanager or unassigned, the I/O slot may be controlled by the resourceand partition manager setting the corresponding slot lock and allocatingthe I/O slot to the requesting logical partition. In this manner, theslot locks 123 effectively serve as semaphores that indicate whether ornot the corresponding I/O slot is available.

Power on/power off slot mechanism 124 is used to assure that an I/O slotis powered down before the slot is removed from a logical partition, andto assure that a slot is powered up when the slot is allocated to alogical partition. In the prior art, an I/O slot may be removed from onelogical partition and allocated to a different logical partition.However, performing this reallocation results in two possible problems.The first problem is an issue of data integrity. It is possible thatdata from a process running in a first logical partition may be retainedin an I/O adapter plugged into an I/O slot when the I/O slot isreassigned to a different logical partition. In theory, one withsufficient skill could conceivably hack into that data from the secondlogical partition, which would compromise the data from the firstlogical partition. The second problem is that the new logical partitionreceiving the I/O slot does not know the current configuration of theI/O slot. In fact, because logical partitions act like differentcomputer systems, a logical partition automatically assumes that an I/Oadapter is in a power-on reset state when the I/O adapter is allocatedto a logical partition. This is certainly a reasonable assumption incomputer systems that are not logically partitioned. If an I/O adapteris to be physically transferred between two different computer systems,the I/O adapter will be unplugged from the first computer system andplugged into the second computer system. The result is that power iscycled on the I/O adapter during the transfer between computer systems,thereby clearing its data and placing the I/O adapter in a power onreset state. The second computer system that received the I/O adapterknows that the I/O adapter is in a power on reset state when thecomputer system first starts up. This assumption, however, does not holdin the case of a logically partitioned computer system. To the contrary,the prior art allows transferring I/O resources between partitionswithout performing any power off or power on cycle, thereby giving riseto the two problems discussed above. The power on/power off slotmechanism 124 solves this problem by assuring that power is alwayscycled on an I/O slot when the slot is removed from one logicalpartition and allocated to a different logical partition, and this ispossible with disruption to operations affecting only that I/O slot andno others that may share the same physical enclosure. In this manner,each logical partition can correctly assume that an I/O adapter is inits power-on reset state when the logical partition first boots up, orwhen an active logical partition receives control of an I/O adapter.

Operating system 126 is a multitasking operating system, such as OS/400,AIX, or Linux; however, those skilled in the art will appreciate thatthe spirit and scope of the present invention is not limited to any oneoperating system. Any suitable operating system can be used. Operatingsystem 126 is a sophisticated program that contains low-level code tomanage the resources of computer system 100. Some of these resources areprocessor 110, main memory 120, mass storage interface 130, displayinterface 140, network interface 150, system bus 160, and I/O slots 180.The operating system 126 in each partition may be the same as theoperating system in other partitions, or may be a completely differentoperating system. Thus, one partition can run the OS/400 operatingsystem, while a different partition can run another instance of OS/400,possibly a different release, or with different environment settings(e.g., time zone). The operating systems in the logical partitions couldeven be different than OS/400, provided it is compatible with thehardware (such as AIX or Linux). In this manner the logical partitionscan provide completely different computing environments on the samephysical computer system.

The partitions 125A-125N are shown in FIG. 1 to reside within the mainmemory 120. However, one skilled in the art will recognize that apartition is a logical construct that includes resources other thanmemory. A logical partition typically specifies a portion of memory,along with an assignment of processor capacity and other systemresources, such as I/O slots 180. Thus, one partition could be definedto include two processors and a portion of memory 120, along with one ormore I/O processors that can provide the functions of mass storageinterface 130, display interface 140, network interface 150, orinterfaces to I/O devices plugged into I/O slots 180. Another partitioncould then be defined to include three other processors, a differentportion of memory 120, and one or more I/O processors. The partitionsare shown in FIG. 1 to symbolically represent logical partitions, whichwould include system resources outside of memory 120 within computersystem 100. Note also that the resource and partition manager 121, theI/O slot lock mechanism 122, and the power on/power off slot mechanism124 preferably reside in memory and hardware separate from thepartitions and are facilities and mechanisms that are not directlyavailable to the partitions. In the alternative, I/O slot lock mechanism122 and power on/power off slot mechanism 124 could reside in any of thedefined partitions in the computer system 100, or even on a computersystem 175 coupled to computer system 100 via network 170.

Computer system 100 utilizes well known virtual addressing mechanismsthat allow the programs of computer system 100 to behave as if they onlyhave access to a large, single storage entity instead of access tomultiple, smaller storage entities such as main memory 120 and DASDdevice 155. Therefore, while resource and partition manager 121 and thepartitions 125A-125N are shown to reside in main memory 120, thoseskilled in the art will recognize that these items are not necessarilyall completely contained in main memory 120 at the same time. It shouldalso be noted that the term “memory” is used herein to generically referto the entire virtual memory of computer system 100.

Processor 110 may be constructed from one or more microprocessors and/orintegrated circuits. Processor 110 executes program instructions storedin main memory 120. Main memory 120 stores programs and data thatprocessor 110 may access. When computer system 100 starts up, processor110 initially executes the program instructions that make up theresource and partition manager 121, which initializes the operatingsystems in the logical partitions.

Although computer system 100 is shown to contain only a single systembus, those skilled in the art will appreciate that the present inventionmay be practiced using a computer system that has multiple buses. Inaddition, the I/O interfaces that are used in the preferred embodimenteach may include separate, fully programmed microprocessors that areused to off-load compute-intensive processing from processor 110, as iniSeries input/output processors, or may be simple industry standard I/Oadapters (IOAs).

Display interface 140 is used to directly connect one or more displays165 to computer system 100. These displays 165, which may benon-intelligent (i.e., dumb) terminals or fully programmableworkstations, are used to allow system administrators and users tocommunicate with computer system 100. Note, however, that while displayinterface 140 is provided to support communication with one or moredisplays 165, computer system 100 does not necessarily require a display165, because all needed interaction with users and other processes mayoccur via network interface 150.

Network interface 150 is used to connect other computer systems and/orworkstations (e.g., 175 in FIG. 1) to computer system 100 across anetwork 170. The present invention applies equally no matter howcomputer system 100 may be connected to other computer systems and/orworkstations, regardless of whether the network connection 170 is madeusing present-day analog and/or digital techniques or via somenetworking mechanism of the future. In addition, many different networkprotocols can be used to implement a network. These protocols arespecialized computer programs that allow computers to communicate acrossnetwork 170. TCP/IP (Transmission Control Protocol/Internet Protocol) isan example of a suitable network protocol.

At this point, it is important to note that while the present inventionhas been and will continue to be described in the context of a fullyfunctional computer system, those skilled in the art will appreciatethat the present invention is capable of being distributed as a programproduct in a variety of forms, and that the present invention appliesequally regardless of the particular type of computer readable signalbearing media used to actually carry out the distribution. Examples ofsuitable signal bearing media include: recordable type media such asfloppy disks and CD RW (e.g., 195 of FIG. 1), and transmission typemedia such as digital and analog communications links.

FIG. 1 shows a sample computer system that shows some of the salientfeatures of both hardware and software in accordance with the preferredembodiments. Now we present a more detailed implementation in FIGS. 2and 3. FIG. 2 is a hardware diagram of a computer system that supportslogical partitions and I/O resource allocation in accordance with thepreferred embodiments. One physical enclosure 210 contains one or moreCPUs 110 and memory 120 coupled together via system bus 160. A secondenclosure 220 is an enclosure that houses I/O components coupled to abus 212 that is coupled to system bus 160. We assume for this particularexample that PCI components are the I/O components contained withinenclosure 220. PCI host bridges 230 are coupled to bus 212, and providean interface to multiple PCI to PCI bridges 240. In FIG. 2, there aretwo PCI host bridges 230A and 230B. PCI host bridge 230A provides aninterface to four PCI to PCI bridges 240A-240D, while PCI host bridge230B provides an interface to four PCI to PCI bridges 240E-240H. EachPCI to PCI bridge 240 connects to a single PCI adapter slot 250. Thus,PCI to PCI bridge 240A is coupled to a corresponding PCI adapter slot250A; PCI to PCI bridge 240B is coupled to a corresponding PCI adapterslot 250B, and so on through PCI to PCI bridge 240H which is coupled toa corresponding PCI adapter slot 250H.

Each PCI host bridge 230 connects to the PCI to PCI bridges 240 via aprimary PCI bus 260. FIG. 2 shows two primary PCI busses 260A and 260B.The PCI to PCI bridges 240 in turn connect to the PCI adapter slots 250via a secondary PCI bus 270. FIG. 2 shows eight secondary PCI busses270, namely 270A-270H that are coupled to their corresponding PCIadapter slots 250A-250H. PCI adapter slots 250 may be either connectorsthat receive a PCI adapter card, or PCI adapter chips embedded directlyon the electronic substrate that contains the corresponding PCI to PCIbridge 240 or PCI host bridge 230. The logical partition operatingsystems “bind” CPU addresses to the PCI adapter memory for memory-mappedI/O from the CPU to the adapters, and bind memory addresses to theadapter, to enable the adapter to perform direct memory access (DMA)operations to and from the mapped memory addresses.

In the preferred embodiments, the presence of PCI to PCI bridges 240between the PCI host bridge 230 and the PCI adapter slots 250 providesignaling and adapter binding isolation between the individual PCIadapters in the PCI adapter slots 250 and the PCI host bridge 230, CPUs110 and memory 120. This isolation facilitates assignment of individualPCI adapter slots to different logical partitions, such that thesepartitions can share the platform hardware connected in common to thePCI to PCI bridges 240, but the operation of adapters assigned to otherpartitions does not disrupt the operation of an adapter assigned to aparticular partition, and the adapter address bindings are enforced sothat no partition or adapter can use another partition-adapter binding.Note that other methods of isolation that enable slot-level allocationsand binding are within the scope of the preferred embodiments, such asassociating each slot with a single PCI host bridge.

The power on/power off slot mechanism 124 shown in FIG. 1 preferablycontrols slot power control hardware in either each PCI host bridge 230or in each PCI to PCI bridge 240. As discussed above with reference toFIG. 1, the power on/power off slot mechanism 124 can apply or removeelectrical power to a particular slot 250 independent of the state ofpower to other I/O components of the platform, including other slots. Inthe most preferred embodiment, there is power on/power off controlhardware in each PCI to PCI bridge 240 subject to power on/power of slotmechanism 124 that controls power to its corresponding slot 250. Thus,for the configuration shown in FIG. 2, PCI to PCI bridge 240A includespower on/power off hardware that controls power to slot 250A; PCI to PCIbridge 240B includes power on/power off hardware that controls power toslot 250B; and so on for each PCI to PCI bridge 240. Thus, for thesystem in FIG. 2, each of the PCI to PCI bridges 240A through 240H willhave power on/power off hardware that controls power to their respectiveslots 250A-250H and that is controlled by power on/power off slotmechanism 124. Note that power on/power off hardware may not necessarilyphysically power down a slot. The preferred embodiments expressly extendto any method for placing a slot and its associated adapter in apower-on reset state. For example, some adapters may be embedded on aprinted circuit board without components to individually control thepower to the adapters. In this case, power on/power off hardware couldplace the adapter in a power-on reset state by flushing all of its dataand placing the adapter in the same state as when it initially powersup, without physically cycling power to the adapter.

The configuration shown in FIG. 2 separates the platform electronicsinto one enclosure 210 containing the CPUs 110 and the memory 120, andthe PCI I/O hardware components (e.g., 230, 240, 250, 260 and 270) intoa separate enclosure 220. This is a common type of separation that isknown in the art. Note, however, that in a small computer system it iscommon to have all elements in FIG. 2 contained in a single enclosure.In larger systems, there may be many CPUs and memory cards, and many PCIadapter slots requiring more PCI host bridges 230 and PCI to PCI bridges240, so that the electronic packaging technologies require multipleelectronic enclosures to contain these hardware elements. The preferredembodiments expressly extend to any suitable hardware configuration,whether all contained in a single enclosure or distributed amongmultiple enclosures.

In the preferred embodiments, it may be desirable to perform hardwareservice on components of the enclosure 220, such as power supplies, toslots, or other components of the enclosure that may require removingelectrical power from all elements of that enclosure. In the preferredembodiments, this is accomplished by first transferring control of theI/O slots within that enclosure from their assigned logical partitionsto the resource and partition manager, then powering off the enclosureand performing the hardware service, powering on the enclosure, and thentransferring the I/O slots back to their assigned logical partitions.The I/O slot locks and related mechanisms for transferring the locksbetween logical partitions and the resource and partition managerfacilitate this sequence of operations while the logical partitionscontinue operating. Note that the resource and partition manager mayoperate in conjunction with a hardware manager to perform these hardwaremanagement functions.

FIG. 3 is a block diagram showing specific software components thatcould implement the invention within the scope of the preferredembodiments. Note that the software components shown in FIG. 3 wouldpreferably execute on a hardware platform such as computer system 200shown in FIG. 2. N logical partitions 125A-125N are shown executingtheir respective operating systems 126A-126N. A hypervisor 300 is shownas one particular implementation of resource and partition manager 121in FIG. 1. Hypervisor 300 includes a hypervisor partition 310 that runsan operating system kernel 312. The operating system kernel 312 isdispatchable and relocatable, and provides typical functions ofoperating system kernels, such as multitasking and memory management.The hypervisor partition 310 executes much as other logical partitionsbut differs from other logical partitions in that it is something of aprivate, or hidden, partition that does not provide for userapplications and that has special authorities to control platformresources and is the only partition authorized to communicate withnon-dispatchable hypervisor 320 via the HvPrimaryCall interface 330. Thehypervisor partition 310 in FIG. 3 may correspond to a partition 125 inFIG. 1, which means the operating system kernel 312 also corresponds toan operating system 126 in FIG. 1. In current iSeries implementations,the hypervisor partition 310 could be called a “primary partition”. TheHvPrimaryCall interface 330 is used by the hypervisor partition 310 toinvoke hypervisor functions performed with the processor in theprivileged, non-dispatchable hypervisor mode.

The logical partitions communicate with the hypervisor via an HvCallinterface 340, which is used by logical partitions to invoke privileged,non-dispatchable hypervisor 320. The non-dispatchable hypervisor 320 isa supervisory agent that is non-dispatchable and non-relocatable; itfunctions by accessing physical addresses. The non-dispatchablehypervisor 320 provides privilege mode functions that are invokedthrough any of: 1) the HvPrimaryCall interface 330 while the hypervisorPartition is scheduling or dispatching logical partition execution; 2)through platform hardware interrupts; and 3) from a logical partitionusing processor supervisory-call instructions defined by the HvCallinterface 340 that place the logical partition execution thread into ahypervisor execution (i.e., privileged) mode.

The hypervisor hardware manager 350 and I/O slot locks 123 arepreferably encapsulated functions within the non-dispatchable hypervisor320, as shown in FIG. 3, but could be implemented in different locationsas well. The hypervisor hardware manager 350 encapsulates the hypervisorfunctions to access and control the PCI host bridge 230 and PCI to PCIbridge 240 hardware in FIG. 2, and to track and enforce the hardwarestates of the PCI adapter slots 250. The slot locks 123 encapsulate thefunctions to set the ownership of the lock and to serialize transfer ofa slot lock between the hypervisor and logical partitions.

The hypervisor partition 310 interacts with the non-dispatchablehypervisor 320 to effect slot state and slot lock transition. Hypervisorpartition 310 is an agent of the system administrator interface 360, andperforms logical partition configuration and platform service operationsrequested by the system administrator 370 through that interface. Notethat system administrator 370 preferably includes an administrationconsole 372 and a hardware management console 374.

In order for the non-dispatchable hypervisor 320 to initiatecommunications with functions in the hypervisor partition 310, thenon-dispatchable hypervisor 320 enqueues messages to an event messagequeue 314 monitored by the hypervisor partition 310. In general, eventmessages from the non-dispatchable hypervisor 320 to the dispatchablehypervisor 310 are used to perform complex hardware control sequences,such as resetting and initializing bridge hardware, scanning virtualaddress translation tables, and performing real time delays associatedwith hardware settle times. Functions in the hypervisor partition 310call the HvPrimaryCall interface 330 to signal completion of operationsthe non-dispatchable hypervisor 320 has requested, to synchronize thesehardware states with the non-dispatchable hypervisor functions.

An I/O slot is typically assigned to a logical partition as part ofconfiguring the platform resources to be used by the logical partition.However, at any given time, system administrator functions may initiatethe transfer of an I/O slot from a logical partition using that slot toanother logical partition, or to simply remove that slot from thelogical partition's configuration, while the logical partition isactive. Similarly, system service functions may require transfer of slotcontrol from an active logical partition to the hypervisor or a serviceagent to perform a service function, such as servicing that slotindividually or servicing other hardware within the same enclosure thatcannot be performed without disruption to that or other slots in thatenclosure.

The slot lock of the preferred embodiments facilitates dynamic transferof control of an I/O slot between a logical partition operating systemand the hypervisor, or between a controlled and an uncontrolled orunassigned state, without removing the I/O slot from the configurationdatabase for the logical partition. The slot lock may be assigned to alogical partition, to the hypervisor, or may be unassigned to any entity(including logical partitions and hypervisor). The slot lock not onlyprovides mutual exclusion between the hypervisor and logical partitions,it also provides a synchronization point to enforce power and resetstate of a slot, and removing OS bindings between OS virtual addressspace and the adapter PCI memory or I/O spaces (memory mapped bindings)and between OS storage and adapter DMA mappings to that storage (e.g.,indirect addresses in PCI memory space that translate to storageaddresses in logical partition memory).

A logical partition operating system may use a function referred toherein as Vary Off to release control of a slot to the hypervisor or toan unassigned state, and may use a function referred to herein as VaryOn to receive control of a slot from the hypervisor. The generalconcepts and methods of logical partition operating systems interactingwith a hypervisor to vary off and vary on a PCI adapter slot have beenimplemented in the AS/400 and eServer (iSeries and pSeries) computersystems that provide logical partitioning. The preferred embodiments usethe new feature of slot locks to control the Vary On and Vary Offprocesses, and to provide a synchronization point for the enforcement ofslot power, reset, and operating system binding states. In addition, thepreferred embodiments also provide an apparatus and method for preparingI/O slots for use and for transferring I/O slots between the hypervisorand logical partitions using a slot lock in relation to a logicalpartition Power On operation, a logical partition Power Off operation,and a logical partition Reboot operation.

The function of the hypervisor 300 is shown in more detail in the flowdiagrams of FIGS. 4-8. FIG. 4 shows a flow diagram of a method 400 forrebooting a logical partition. Note that the hypervisor 300 may decideto reboot a logical partition, or the logical partition itself maysignal the hypervisor 300 that it is shutting down and should berebooted. When the hypervisor 300 of FIG. 3 needs to reboot a logicalpartition, the hypervisor partition 310 signals a reboot to theoperating system 126 in the logical partition (step 1a). Thus, ifhypervisor 300 wants to reboot logical partition 125A in FIG. 3, thehypervisor (HV) partition 310 signals the logical partition operatingsystem 126A to shutdown for reboot. The logical partition operatingsystem 126A performs housekeeping chores to prepare its I/O adapters andto clean up for shutdown, and then signals the HV partition 310 toinitiate reboot (step 1b). If the partition determines it needs toreboot, it performs its housekeeping chores to prepare its I/O adaptersand to clean up for shutdown, and then signals the HV partition 310 toinitiate reboot (step 1b) without the hypervisor requesting a reboot instep 1a. The HV partition 310 then stops the execution of the logicalpartition CPUs (step 2), terminating the logical partition operatingsystem. In a normal shutdown, the LP OS 126 completes its housekeepingchores before shutting down. However, if the logical partition hascrashed, it may be unable to complete any of the housekeeping choresbefore shutting down.

The HV partition 310 then calls the setSlotLock function (step 3) to theHvPrimaryCall interface 330. This transfers control of a slot that iscurrently under the control of the partition being rebooted to thehypervisor. Three parameters are passed with the setSlotLock call,namely: slot, from_LP, to_HV. The slot parameter specifies the slot ofinterest. The from_LP parameter specifies the logical partition that isbeing rebooted (which currently controls the slot), and the to_HVparameter specifies that the slot lock is being transferred to becontrolled by the hypervisor 300. In executing the setSlotLock call, thehypervisor performs step 4, which gets a multiprocessor thread lock onthe slot lock storage. In this specific implementation, this means thatno other multiprocessor CPU threads can access any slot lock while theslot lock storage is locked. However, it is equally within the scope ofthe preferred embodiments to provide slot locks that may be individuallylocked instead of locking the entire slot lock storage. The status ofthe slot lock is then checked to see if it is currently owned by thelogical partition being rebooted (if lock[slot]=LP). If the slot lock isowned by the logical partition being rebooted, ownership of the slotlock is transferred to the hypervisor (lock[slot]=HV), and the returnstatus is set to SUCCESS. If the slot lock is not owned by the logicalpartition being rebooted, the return status is set to FAIL. Themultiprocessor (MP) thread lock is then released on the slot lockstorage. Next, the hypervisor interacts with the hypervisor hardwaremanager (HV HW MGR) 350 (step 5). If the slot lock status from step 4 isSUCCESS, the slot I/O and control authority are passed to thehypervisor, and the logical partition bindings to the slot are removedand disabled so that subsequent attempts by this or other logicalpartitions, or by the I/O adapter in the to slot, to establish orutilize such bindings will fail.

The HV partition 310 then calls the HvPrimaryCall interface 330 in steps6-9. In step 6, a call is made to reset the PCI to PCI bridge for theslot, and to assert a reset signal PCI RST to the slot. In step 7, acall is made to re-initialize the PCI to PCI bridge. In step 8, a callis made to power on the slot. Note that step 8 could be omitted becausethe slot was reset in step 6, not powered down. Step 8 is typicallyperformed after the hypervisor partition has performed reboot processingnot related to I/O slots. In step 9, a setSlotLock call is made thatspecifies the slot of interest, that ownership is being transferred fromthe hypervisor (from_HV) to the logical partition (to_LP). Next, thehypervisor attempts to assign a slot lock to the logical partition beingrebooted (step 10). First, a multi-processor thread lock is achieved onthe slot lock to prevent other threads from attempting to get the slotlock at the same time. If the slot lock is owned by the hypervisor, theslot lock is set to the logical partition being rebooted, and the statusis set to SUCCESS. If the slot lock is not owned by that logicalpartition, the status is set to FAIL. The multiprocessor thread lock isthen released. Next, in step 11, HvPrimaryCall 330 determines the statusof the slot lock. If the slot lock status is SUCCESS, the hypervisorhardware manager 350 is invoked, setting slot I/O and control authorityto the logical partition, reinitializing the slot control hardware, andenabling the logical partition bindings to the slot.

Note that steps 3-11 are performed for each slot assigned to the logicalpartition. Once steps 3-11 have been performed for all I/O slots,execution of the logical partition operating system 126 may commence(step 12). In the preferred embodiments, some of these steps may beserialized while others may be performed in parallel for differentslots. For example, slots may be transferred to the hypervisor one at atime. All slots may then be reset and initialized with lots ofparallelism. Once partition reboot processing is done, slots are poweredon with lots of parallelism. Finally, the slots are allocated back tothe partition one at a time.

FIG. 5 shows a method 500 for powering off a logical partition withinthe scope of the preferred embodiments. Powering off a logical partitionmay occur in response to a request from a system administrator 510 (step1a), or in response to a scheduled power down (step 1b). The logicalpartition operating system 126 is signaled to power down (step 2a)followed by a signal to the hypervisor when the logical partition isprepared (step 2b). As in the case of reboot, the logical partitionperforms housekeeping and I/O adapter preparation prior to signaling thehypervisor it is ready to be shutdown. Following step 2b, the hypervisorthen stops the execution of the logical partition CPUs (step 3),terminating the logical partition operating system to establish itslogically powered off state. The setSlotLock function is then invoked onthe HvPrimaryCall interface 330 (step 4), specifying three parameters:slot, which specifies the slot of interest; from_LP, which specifies thelogical partition that the slot lock ownership is being transferredfrom; and to_HV, which specifies that the slot lock ownership is beingtransferred to the hypervisor. In response, the HVPrimaryCall interface330 interacts with the slot lock 123 as shown in step 5. First, it getsa multiprocessor thread lock on the slot lock. If the slot lock iscurrently owned by the logical partition being powered down(lock[slot]=LP), the slot lock is reassigned to the hypervisor(lock[slot]=HV), and the return status is set to SUCCESS. If the slotlock is not currently owned by the logical partition being powered down,the return status is set to FAIL. After the return status is set to itsappropriate value, the multiprocessor thread lock is released.

If the slot lock status is SUCCESS, hypervisor hardware manager 350 isnext invoked in step 6. The slot control hardware and logical partitionbindings to the slot are reset, and the slot I/O and control authorityare transferred to the hypervisor. Once step 6 is complete, step 7resets the bridge hardware for the slot and asserts the reset signal PCIRST to the slot, and step 8 sets the slot power off. Step 9 is a call totransfer the slot lock from the hypervisor to make the slot lockunassigned, which is implemented in step 10. Note that steps 4-10 areperformed for each slot assigned to the logical partition being powereddown.

FIG. 6 shows a method 600 for powering on a logical partition inaccordance with the preferred embodiments. The process begins when asystem administrator 510 requests that a logical partition be powered on(step 1). Logical partition power on may also be a scheduled task of theHV partition 310. The HV Partition 310 issues a setSlotLock call totransfer the slot from unassigned to the hypervisor (step 2). Inresponse, step 3 is performed, which gets a multiprocessor thread lock,and if the slot lock is unassigned, it assigns the slot lock to thehypervisor. Only if the slot lock is successfully transferred to thehypervisor in step 3 are the remaining steps in FIG. 6 performed.

Bridge slot hardware for a selected slot is initialized (step 4), andthe slot is then powered on (step 5). The slot reset signal PCI RST isthen deasserted (step 6), which takes the slot out of its reset stateand allows it to function. A setSlotLock call is then made (step 7) thatspecifies the slot of interest (slot), that the slot lock of interestcurrently belongs to the hypervisor (from_HV), and that ownership of theslot lock of interest is to be transferred to the logical partitionbeing powered up (to_LP).

Next, step 8 is performed, which gets a multiprocessor thread lock onthe slot lock, determines if the slot lock is currently owned by thehypervisor, and if so, allocates the slot lock to the logical partition.The return status is then set to SUCCESS. If the slot lock is alreadyowned by a different partition, the return status is set to FAIL. Oncethe return status is set to its appropriate value, the multiprocessorthread lock is released.

Step 9 is then performed, which checks the slot lock status, and if itindicates SUCCESS, the slot I/O and control authority are set to thelogical partition, the slot control hardware is initialized, and thebindings from the slot to the logical partition are enabled. Note thatsteps 2-9 are performed for each slot assigned to the target logicalpartition being powered up. Once steps 2-9 have been performed for eachslot assigned to the target logical partition, the execution of thelogical partition operating system 126 is started (step 10).

As described above, there are times when control of an I/O slot may betransferred from one logical partition to another without powering offor rebooting either partition. Similarly, there are times when controlof an I/O slot may be transferred from a logical partition to thehypervisor for maintenance, also without powering off or rebooting thatlogical partition. The function “Vary Off” allows an active partition todynamically relinquish control of a slot without relinquishing ownershipof the slot in the platform partition configuration database. In similarfashion, the function “Vary On” allows an active partition todynamically acquire control of a slot that it owns according to theplatform partition configuration database. FIGS. 7 and 8 show thedetailed flow diagrams for the vary off and vary on functions,respectively.

Referring now to FIG. 7, a method 700 for implementing the vary offfunction in accordance with the preferred embodiments begins when asystem administrator 510 or other system manager (such as a workloadmanager application) sends a vary off message to a logical partitionoperating system 126 (step 1). In return, the logical partitionoperating system calls the HvCall interface 340 to assert the slot resetsignal PCI RST to the slot (step 2). The hypervisor hardware manager 350is then invoked to assert the PCI RST signal (step 3). In response, thehypervisor hardware manager 350 places a message on an event queue thatallows the non-dispatchable hypervisor to communicate with thehypervisor partition (step 4). The HV Partition 310 monitors the eventqueue for messages, and when it sees the queued message in step 4, itcalls the HvPrimaryCall interface 330 to assert the PCI RST signal (step5). The hypervisor hardware manager 350 is then invoked to assert thePCI RST signal (step 6). HV Partition 310 delays for some period of timeto allow the state of the hardware to settle. After waiting theappropriate delay, the HV Partition 310 then signals thatnon-dispatchable hypervisor that PCI RST processing is complete (step7). The completion of PCI RST processing is then signaled to thehypervisor hardware manager (step 8).

Next, the logical partition operating system 126 calls the HvPrimaryCallinterface 330 to request power off of the slot (step 9). This invokesthe hypervisor hardware manager (step 10), which generates a power offslot event in the event queue to the HV Partition 310 (step 11). Oncethe HV Partition 310 sees the logical partition event “power off slot”on the event queue in step 11, it invokes the HvPrimaryCall interface330 to power off the slot and reset the bridge hardware (step 12). Thisis then passed to the hypervisor hardware manager (step 13). The HVPartition 310 again waits a predetermined delay to allow the hardware tosettle, then calls the HvPrimaryCall interface 330 to signal thenon-dispatchable hypervisor that slot power off processing is complete(step 14). This is then relayed to the hypervisor hardware manager (step15). In this detailed implementation, the PCI RST signal is firstasserted to reset the secondary bus under the PCI to PCI bridgecorresponding to a slot, and then the bridge itself is reset, whichisolated everything under the bridge. At this point, the slot is powereddown.

The LP OS 126 also calls to the HvCall interface 340 to release the OSbindings to the adapters (step 16). In response, the HvCall interface340 calls the hypervisor hardware manager 350 to unbind the adaptermappings (step 17). An event message is then queued on the event queueto unbind the mappings for the slot (step 18). The HV Partition 310 thenunmaps the page table and DMA bindings for the adapter in this slot(step 19). The HV Partition 310 then signals that the memory mappingshave been unbound (step 20). The HvPrimaryCall 330 relays this to thehypervisor hardware manager (step 21).

The LP OS 126 calls to the HvCall interface 340 to release controlauthority for the slot (step 22). In response, step 23 is performed. Ifthe slot is powered off and the bindings are unmapped, the slot I/O andcontrol authority is set to the hypervisor, and SUCCESS is returned.Otherwise, FAIL is returned. Then step 24 is performed. First, amultiprocessor thread lock is obtained. If the slot lock is currentlyowned by the logical partition (lock[slot]=LP) and step 23 indicatedSUCCESS, the slot lock ownership is relinquished (lock[slot]=unassigned)and return status is set to SUCCESS. Otherwise, return status is set toFAIL. The multiprocessor thread lock is then released. The SystemAdministrator 510 or other agent that requested the vary off function instep 1 will then use the status returned from the sequence of steps 1through 23 to determine whether the vary off function was successful. Atthis point the slot has been relinquished and can now be transferred toa different partition or taken over by a hardware service tool.

Method 800 of FIG. 8 shows steps performed in implementing the vary onmessage within the scope of the preferred embodiments. First, a systemadministrator or other system manager sends a vary on message to thelogical partition operating system 126 (step 1). Next, the LP OS 126calls the HvCall interface 340 to acquire control authority over theslot (step 2). In response, step 3 is performed, which gets amultiprocessor thread lock on the slot lock. If the slot lock iscurrently owned by the requesting partition or is unassigned, the slotlock is assigned to the logical partition, and the return status is setto SUCCESS. If the slot lock is not owned by this logical partition, thereturn status is set to FAIL. The multiprocessor thread lock is thenreleased.

Assuming the status is SUCCESS for step 3, steps 4-20 may be performed.In step 4, slot I/O and control authority for the slot is set to thelogical partition (step 4). The LP OS 126 calls the HvCall interface 340to enable DMA and virtual address (VA) binding (step 5). This calls thehypervisor hardware manager (step 6). In response, the hypervisorhardware manager 350 enables the DMA and VA bindings for the adapter inthe slot. The LP OS 126 also calls the HvCall interface 340 to power onthe slot (step 7). This invokes the hypervisor hardware manager (step8). In response, an event is placed on the event queue that requeststhat the slot be powered on (step 9). In response, the HV Partition 310calls the HvPrimaryCall interface 330 to initialize the bridge and topower on the slot (step 10). This invokes the hypervisor hardwaremanager (step 11). After waiting an appropriate delay to assure the slotis powered on and stable, a message is sent indicating that power onprocessing is complete (step 12). The completion of power on processingis then signaled to the hypervisor hardware manager (step 13).

The LP OS 126 calls the HvCall interface 340 to deassert the PCI RSTsignal to the slot (step 14). This invokes the hypervisor hardwaremanager (step 15). In response, an event message is written to the eventqueue requesting that the PCI RST signal be deasserted (step 16). AnHvPrimaryCall is then made to deassert the PCI RST signal (step 17),which is passed to the hypervisor hardware manager (step 18). Inresponse, the non-dispatchable hypervisor deasserts the PCI RST signalto the slot. After waiting an appropriate delay to assure the slot isout of reset and stable, a the HV partition 310 calls the HvPrimaryCallinterface 330 to indicate that PCI RST processing is complete (step 19).The completion of PCI RST processing is also communicated to thehypervisor hardware manager (step 20). At this point the logicalpartition has acquired control of the slot and can resume operationsusing the slot.

The preferred embodiments provide a significant advance over the priorart by providing slot locks that must be obtained before operations onthe slot may be performed, and by assuring that a slot is powered offand then on again before control of that slot can be transferred betweenentities. A slot may be controlled by a logical partition, by thehypervisor, or may be unassigned. Note that various agents under controlof the hypervisor, such as hardware managers, may control slots as well.The mutually exclusive slot locks assure non-conflicting access to slotsby competing entities. Powering down a slot when it is removed from alogical partition eliminates the issue of data integrity for data in anadapter and assures the adapter is always in a power-on reset state whenit is allocated to a logical partition.

One skilled in the art will appreciate that many variations are possiblewithin the scope of the present invention. Thus, while the invention hasbeen particularly shown and described with reference to preferredembodiments thereof, it will be understood by those skilled in the artthat these and other changes in form and details may be made thereinwithout departing from the spirit and scope of the invention. Forexample, while PCI slots are shown as an example of a specific type ofresource that may be independently controlled, other types of resourcesbesides PCI slots could also be controlled within the scope of thepreferred embodiments. For example, various different types of I/O slotsor adapters, such as PCMCIA slots, S390 channel or control units, etc.could be controlled using the teachings of the preferred embodiments.Other types of like resources that could be controlled in accordancewith the preferred embodiments include I/O buses, I/O communicationchannels (such as Infiniband queue pairs), virtual slots or devices,CPUs, and memory blocks.

1. An apparatus comprising: at least one processor; a memory coupled tothe at least one processor; a plurality of input/output (I/O) slotscoupled to the at least one processor; a plurality of locks residing inthe memory, wherein each of the plurality of I/O slots has acorresponding lock; a plurality of logical partitions defined on theapparatus; a lock mechanism that controls access to each I/O slot by theplurality of logical partitions by requiring exclusive ownership of thecorresponding lock before transferring control of the corresponding I/Oslot to one of the plurality of logical partitions and before allowingone of the plurality of logical partitions to access the correspondingI/O slot; a mechanism for releasing all memory and virtual addressbindings to an adapter in a selected I/O slot when control of theselected I/O slot is removed from one of the plurality of logicalpartitions; a mechanism for one of the plurality of logical partitionsto relinquish control of an I/O slot that the one logical partition ownswithout relinquishing ownership of the I/O slot; and a mechanism for theone logical partition to regain control of an I/O slot that the logicalpartition owns but for which the one logical partition previouslyrelinquished control.
 2. The apparatus of claim 1 further comprising amechanism for enabling memory and virtual address bindings to theadapter in the selected I/O slot when control of the selected I/O slotis transferred to one of the plurality of logical partitions.
 3. Theapparatus of claim 1 further comprising a mechanism for transferringcontrol of a selected I/O slot to a resource and partition manager whencontrol of the selected I/O slot is removed from one of the plurality oflogical partitions.
 4. The apparatus of claim 1 further comprising amechanism for transferring control of a selected I/O slot to one of theplurality of logical partitions when control of the selected I/O slot istransferred to the one logical partition. 5-6. (canceled)
 7. A programproduct comprising: a lock mechanism that defines a plurality of locks,wherein each of a plurality of input/output (I/O) slots has acorresponding lock, the lock mechanism controlling access to theplurality of I/O slots in a computer system that includes a plurality oflogical partitions by requiring exclusive ownership of the correspondinglock before transferring control of the corresponding I/O slot to one ofthe plurality of logical partitions and before allowing one of theplurality of logical partitions to access the corresponding I/O slot,the lock mechanism releasing all memory and virtual address bindings toan adapter in a selected I/O slot when control of the selected I/O slotis removed from one of the plurality of logical partitions; a secondmechanism for one of the plurality of logical partitions to relinquishcontrol of an I/O slot that the one logical partition owns withoutrelinquishing ownership of the I/O slot; a third mechanism for the onelogical partition to regain control of an I/O slot that the logicalpartition owns but for which the one logical partition previouslyrelinquished control; and recordable media bearing the lock mechanism,the second mechanism, and the third mechanism.
 8. The program product ofclaim 7 further comprising a mechanism residing on the recordable mediafor enabling memory and virtual address bindings to an adapter in aselected I/O slot when control of the selected I/O slot is transferredto one of the plurality of logical partitions.
 9. The program product ofclaim 7 further comprising a mechanism residing on the recordable mediafor transferring control of a selected I/O slot to a resource andpartition manager when control of the selected I/O slot is removed fromone of the plurality of logical partitions.
 10. The program product ofclaim 7 further comprising a mechanism residing on the recordable mediafor transferring control of a selected I/O slot to one of the pluralityof logical partitions when control of the selected I/O slot istransferred to the one logical partition. 11-12. (canceled)